embedding generator

安装量: 47
排名: #15760

安装

npx skills add https://github.com/eddiebe147/claude-settings --skill 'Embedding Generator'

Embedding Generator The Embedding Generator skill helps you create, manage, and utilize text embeddings for semantic search, similarity matching, clustering, and classification tasks. It guides you through selecting appropriate embedding models, preprocessing text for optimal vectorization, and storing/querying embeddings efficiently. Text embeddings transform words, sentences, or documents into dense numerical vectors that capture semantic meaning. Similar concepts end up close together in vector space, enabling powerful AI applications like semantic search, recommendations, and content understanding. This skill covers everything from choosing the right model (OpenAI, Cohere, sentence-transformers, etc.) to implementing production-ready embedding pipelines with proper batching, caching, and quality validation. Core Workflows Workflow 1: Generate Embeddings for Text Corpus Analyze the text corpus: Content type (documents, sentences, queries) Average length and variation Language(s) present Domain specificity Select embedding model: Consider dimensionality vs performance tradeoff Match model to content type Evaluate cost and latency constraints Preprocess text: Clean and normalize Chunk long documents appropriately Handle special characters and formatting Generate embeddings with batching Validate quality with spot checks Store in appropriate vector database Workflow 2: Choose Embedding Model Gather requirements: Use case (search, clustering, classification) Latency requirements Cost constraints Accuracy needs Compare models: Model Dims Speed Quality Cost OpenAI text-embedding-3-small 1536 Fast Good $$ OpenAI text-embedding-3-large 3072 Fast Best $$$ Cohere embed-english-v3 1024 Fast Great $$ sentence-transformers 384-768 Varies Good Free Voyage AI 1024 Fast Great $$ Benchmark on representative samples Document decision rationale Workflow 3: Implement Embedding Pipeline Design pipeline architecture: Input preprocessing Batching strategy Error handling Caching layer Implement core components:

Example pipeline structure

def
embedding_pipeline
(
texts
)
:
cleaned
=
preprocess
(
texts
)
chunks
=
chunk_if_needed
(
cleaned
)
batches
=
create_batches
(
chunks
,
batch_size
=
100
)
embeddings
=
[
]
for
batch
in
batches
:
result
=
model
.
embed
(
batch
)
embeddings
.
extend
(
result
)
return
embeddings
Add
monitoring and logging
Test
with edge cases
Optimize
for production scale
Quick Reference
Action
Command/Trigger
Generate embeddings
"Generate embeddings for these texts"
Choose model
"Which embedding model for [use case]"
Compare models
"Compare embedding models"
Optimize pipeline
"Speed up embedding generation"
Validate quality
"Check embedding quality"
Chunk documents
"How to chunk for embeddings"
Best Practices
Match Model to Use Case
Query-document search needs asymmetric models; clustering needs symmetric
Search: Use models trained on query-passage pairs
Clustering: Use models with good sentence-level representations
Chunk Intelligently
Long texts must be chunked, but chunking strategy matters
Preserve semantic units (paragraphs, sections)
Use overlapping chunks for continuity (10-20% overlap)
Keep chunk size within model's sweet spot (typically 256-512 tokens)
Batch for Efficiency
API calls are expensive; batch aggressively
OpenAI: Up to 2048 texts per batch
Use async/concurrent processing for speed
Implement exponential backoff for rate limits
Cache Embeddings
Don't regenerate what you've already computed
Hash text to create cache keys
Store embeddings with metadata
Invalidate cache when model changes
Normalize Vectors
Cosine similarity requires normalized vectors
Most models output normalized vectors
Verify or normalize explicitly for consistency
Validate Quality
Spot-check embeddings before production use Test similarity between known-similar texts Check that distances make semantic sense Compare against baseline or ground truth Advanced Techniques Hybrid Chunking Strategy Combine semantic and size-based chunking: def hybrid_chunk ( text , max_tokens = 512 ) :

First: Split on semantic boundaries

sections

split_on_headers_paragraphs ( text )

Then: Split large sections on size

chunks

[ ] for section in sections : if token_count ( section )

max_tokens : chunks . extend ( split_with_overlap ( section , max_tokens ) ) else : chunks . append ( section ) return chunks Query Expansion for Better Retrieval Generate multiple query embeddings for robust search: Original: "machine learning frameworks" Expanded: [ "machine learning frameworks", "ML libraries and tools", "deep learning software", "AI development platforms" ] Dimensionality Reduction When storage or speed is critical: - PCA: Fast, linear reduction - UMAP: Preserves local structure - Matryoshka embeddings: Models with variable-size outputs Cross-Lingual Embeddings For multilingual applications: - Use multilingual models (mBERT, XLM-R, Cohere multilingual) - Translate queries to embedding language - Align embedding spaces post-hoc Common Pitfalls to Avoid Using the wrong model type (asymmetric vs symmetric) for your use case Chunking in ways that break semantic meaning (mid-sentence, mid-paragraph) Not accounting for rate limits in production systems Storing embeddings without metadata needed for filtering Regenerating embeddings unnecessarily (implement caching) Mixing embeddings from different models in the same index Ignoring the impact of text preprocessing on embedding quality

返回排行榜